Size is not Everything Genre Balance in Bootstrapping a Swedish PoS Tagger

نویسنده

  • Eva Forsbom
چکیده

Part-of-speech tagging is a basic component of natural language processing, and as such, needs to be as accurate as possible, or any subsequent processing will suffer. For Swedish, most tagger models are trained on the Stockholm-Umeå Corpus (SUC Ejerhed et al., 2006). As SUC is a balanced corpus, SUC models are better representatives for general language than models trained on news texts only, which is a common scenario for other languages. On the other hand, the corpus is a bit too small for tagger training, considering the size of the tagset needed to express the most common morphosyntactic features of Swedish. This leads to poorer performance than what has been reported for, for example, an equally-sized English news corpus and a much smaller German news corpus, tagged with the statistical TnT tagger (Brants, 2000). Both the English and German models show an accuracy of 96.7%, while the same tagger trained on SUC only has an accuracy of 95.5%. As SUC obviously is too small to be used alone as training data for any higher-accuracy tagger, we have used it as a seed corpus to bootstrap a much larger, unannotated, corpus, that can be added as training data. The bootstrapped corpus could represent another modality, domain or genre, if we are looking for adaptation. This sort of bootstrapping process has proved to be a viable approach (cf. Forsbom, 2006; Merialdo, 1994; Nivre and Grönqvist, 2001; Sjöbergh, 2003). Here, we are interested in seeing the effect the genre balance of bootstrapped corpus has on the performance, when drilling down by SUC genres.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Big is beautiful Bootstrapping a PoS tagger for Swedish

A statistical part-of-speech tagger trained on a one-million word Swedish corpus with validated tags was used to tag two considerably larger untagged corpora (≈ 78 and 20 million words, respectively) to bootstrap new, improved, tagger models. The new taggers all showed better accuracy both for seen and unseen words, and the best tagger had 97.02% overall accuracy evaluated on the original corpu...

متن کامل

Stagger: A modern POS tagger for Swedish

The field of Part of Speech (POS) tagging has made slow but steady progress during the last decade, though many of the new methods developed have not previously been applied to Swedish. I present a new system, based on the Averaged Perceptron algorithm and semi-supervised learning, that is more accurate than previous Swedish POS taggers. Furthermore, a new version of the Stockholm-Umeå Corpus i...

متن کامل

Extending the View: Explorations in Bootstrapping a Swedish PoS Tagger

State-of-the-art statistical part-of-speech taggers mainly use information on tag bior trigrams, depending on the size of the training corpus. Some also use lexical emission probabilities above unigrams with beneficial results. In both cases, a wider context usually gives better accuracy for a large training corpus, which in turn gives better accuracy than a smaller one. Large corpora with vali...

متن کامل

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008